25 research outputs found
Longquan celadon: a quantified archaeological analysis of a pan-Indian Ocean industry of the 12th to 15th centuries
This paper examines the Longquan celadon industry, located in Zhejiang province in China, which flourished mainly between the Southern Song and early Ming dynasties (12th to 15th century). The products of this industry are found on archaeological sites from across China and the Indian Ocean. This paper attempts a quantified analysis of the development of the industry based on archaeological data, focussing on four aspects: production, domestic consumption, overseas consumption and, to a lesser degree, workshop organisation. Although much of the data is still, in many ways, problematic, and many of the conclusions drawn are necessarily tentative, it is possible to demonstrate the value and timeliness of the approach by charting the overall development of this industry and by arguing that the close integration of the four aspects examined indicates that the Longquan celadon industry was an industry of considerable economic significance across much of the Indian Ocean
DDF-HO: Hand-Held Object Reconstruction via Conditional Directed Distance Field
Reconstructing hand-held objects from a single RGB image is an important and
challenging problem. Existing works utilizing Signed Distance Fields (SDF)
reveal limitations in comprehensively capturing the complex hand-object
interactions, since SDF is only reliable within the proximity of the target,
and hence, infeasible to simultaneously encode local hand and object cues. To
address this issue, we propose DDF-HO, a novel approach leveraging Directed
Distance Field (DDF) as the shape representation. Unlike SDF, DDF maps a ray in
3D space, consisting of an origin and a direction, to corresponding DDF values,
including a binary visibility signal determining whether the ray intersects the
objects and a distance value measuring the distance from origin to target in
the given direction. We randomly sample multiple rays and collect local to
global geometric features for them by introducing a novel 2D ray-based feature
aggregation scheme and a 3D intersection-aware hand pose embedding, combining
2D-3D features to model hand-object interactions. Extensive experiments on
synthetic and real-world datasets demonstrate that DDF-HO consistently
outperforms all baseline methods by a large margin, especially under Chamfer
Distance, with about 80% leap forward. Codes and trained models will be
released soon
OPA-3D: Occlusion-Aware Pixel-Wise Aggregation for Monocular 3D Object Detection
Despite monocular 3D object detection having recently made a significant leap
forward thanks to the use of pre-trained depth estimators for pseudo-LiDAR
recovery, such two-stage methods typically suffer from overfitting and are
incapable of explicitly encapsulating the geometric relation between depth and
object bounding box. To overcome this limitation, we instead propose OPA-3D, a
single-stage, end-to-end, Occlusion-Aware Pixel-Wise Aggregation network that
to jointly estimate dense scene depth with depth-bounding box residuals and
object bounding boxes, allowing a two-stream detection of 3D objects, leading
to significantly more robust detections. Thereby, the geometry stream denoted
as the Geometry Stream, combines visible depth and depth-bounding box residuals
to recover the object bounding box via explicit occlusion-aware optimization.
In addition, a bounding box based geometry projection scheme is employed in an
effort to enhance distance perception. The second stream, named as the Context
Stream, directly regresses 3D object location and size. This novel two-stream
representation further enables us to enforce cross-stream consistency terms
which aligns the outputs of both streams, improving the overall performance.
Extensive experiments on the public benchmark demonstrate that OPA-3D
outperforms state-of-the-art methods on the main Car category, whilst keeping a
real-time inference speed. We plan to release all codes and trained models
soon
CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graphs
Controllable scene synthesis aims to create interactive environments for
various industrial use cases. Scene graphs provide a highly suitable interface
to facilitate these applications by abstracting the scene context in a compact
manner. Existing methods, reliant on retrieval from extensive databases or
pre-trained shape embeddings, often overlook scene-object and object-object
relationships, leading to inconsistent results due to their limited generation
capacity. To address this issue, we present CommonScenes, a fully generative
model that converts scene graphs into corresponding controllable 3D scenes,
which are semantically realistic and conform to commonsense. Our pipeline
consists of two branches, one predicting the overall scene layout via a
variational auto-encoder and the other generating compatible shapes via latent
diffusion, capturing global scene-object and local inter-object relationships
while preserving shape diversity. The generated scenes can be manipulated by
editing the input scene graph and sampling the noise in the diffusion model.
Due to lacking a scene graph dataset offering high-quality object-level meshes
with relations, we also construct SG-FRONT, enriching the off-the-shelf indoor
dataset 3D-FRONT with additional scene graph labels. Extensive experiments are
conducted on SG-FRONT where CommonScenes shows clear advantages over other
methods regarding generation consistency, quality, and diversity. Codes and the
dataset will be released upon acceptance
CCD-3DR: Consistent Conditioning in Diffusion for Single-Image 3D Reconstruction
In this paper, we present a novel shape reconstruction method leveraging
diffusion model to generate 3D sparse point cloud for the object captured in a
single RGB image. Recent methods typically leverage global embedding or local
projection-based features as the condition to guide the diffusion model.
However, such strategies fail to consistently align the denoised point cloud
with the given image, leading to unstable conditioning and inferior
performance. In this paper, we present CCD-3DR, which exploits a novel centered
diffusion probabilistic model for consistent local feature conditioning. We
constrain the noise and sampled point cloud from the diffusion model into a
subspace where the point cloud center remains unchanged during the forward
diffusion process and reverse process. The stable point cloud center further
serves as an anchor to align each point with its corresponding local
projection-based features. Extensive experiments on synthetic benchmark
ShapeNet-R2N2 demonstrate that CCD-3DR outperforms all competitors by a large
margin, with over 40% improvement. We also provide results on real-world
dataset Pix3D to thoroughly demonstrate the potential of CCD-3DR in real-world
applications. Codes will be released soonComment: 11 page
On the Importance of Accurate Geometry Data for Dense 3D Vision Tasks
Learning-based methods to solve dense 3D vision problems typically train on
3D sensor data. The respectively used principle of measuring distances provides
advantages and drawbacks. These are typically not compared nor discussed in the
literature due to a lack of multi-modal datasets. Texture-less regions are
problematic for structure from motion and stereo, reflective material poses
issues for active sensing, and distances for translucent objects are intricate
to measure with existing hardware. Training on inaccurate or corrupt data
induces model bias and hampers generalisation capabilities. These effects
remain unnoticed if the sensor measurement is considered as ground truth during
the evaluation. This paper investigates the effect of sensor errors for the
dense 3D vision tasks of depth estimation and reconstruction. We rigorously
show the significant impact of sensor characteristics on the learned
predictions and notice generalisation issues arising from various technologies
in everyday household environments. For evaluation, we introduce a carefully
designed dataset\footnote{dataset available at
https://github.com/Junggy/HAMMER-dataset} comprising measurements from
commodity sensors, namely D-ToF, I-ToF, passive/active stereo, and monocular
RGB+P. Our study quantifies the considerable sensor noise impact and paves the
way to improved dense vision estimates and targeted data fusion.Comment: Accepted at CVPR 2023, Main Paper + Supp. Mat. arXiv admin note:
substantial text overlap with arXiv:2205.0456
HouseCat6D -- A Large-Scale Multi-Modal Category Level 6D Object Pose Dataset with Household Objects in Realistic Scenarios
Estimating the 6D pose of objects is a major 3D computer vision problem.
Since the promising outcomes from instance-level approaches, research heads
also move towards category-level pose estimation for more practical application
scenarios. However, unlike well-established instance-level pose datasets,
available category-level datasets lack annotation quality and provided pose
quantity. We propose the new category-level 6D pose dataset HouseCat6D
featuring 1) Multi-modality of Polarimetric RGB and Depth (RGBD+P), 2) Highly
diverse 194 objects of 10 household object categories including 2
photometrically challenging categories, 3) High-quality pose annotation with an
error range of only 1.35 mm to 1.74 mm, 4) 41 large-scale scenes with extensive
viewpoint coverage and occlusions, 5) Checkerboard-free environment throughout
the entire scene, and 6) Additionally annotated dense 6D parallel-jaw grasps.
Furthermore, we also provide benchmark results of state-of-the-art
category-level pose estimation networks